# mean (years)
mapping <- Fatalities %>% select(state,year,pop,fatal) %>% mutate(state=toupper(state))
map <- mapping %>% group_by(state) %>% summarise(pop=mean(pop),fatal=mean(fatal))
## `summarise()` ungrouping output (override with `.groups` argument)
map$hover <- with(map, paste(state, '<br>', "pop", pop, "fatal", fatal))
# give state boundaries a white border
l <- list(color = toRGB("white"), width = 2)
# specify some map projection/options
g <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = TRUE,
lakecolor = toRGB('white')
)
fig <- plot_geo(map, locationmode = 'USA-states')
fig <- fig %>% add_trace(
z = ~fatal, text=~hover,locations = ~state,
color = ~fatal, colors = 'Purples'
)
fig <- fig %>% colorbar(title = "Number of vehicle fatalities")
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
fig <- fig %>% layout(
title = 'US Traffic Fatalities',
geo = g,
yaxis = list(type = "log")
)
fig
# by year
gg <- ggplot(mapping, aes(pop,fatal, color = state)) +
geom_point(aes(size = fatal, frame = year, ids = state)) +
scale_x_log10()
## Warning: Ignoring unknown aesthetics: frame, ids
ggplotly(gg)
Note: - The report might be shown to the class as examples if it is done very well or very badly, with your identity redacted. - Each student needs to write and submit the report on their own for Project 1. - Remove these remarks in your submission.
In this section, state the questions of interest, motivation of this analysis, and potential impact of your results. You can simply rephrase the Project Description for minimal efforts. You can also cite published papers or credible articles on the Internet. For instance, you may find this brief very relevant. More can be found by searching the key words “class size”, “education”, “performance.” See, among others,here for proper citation formats.
In this section, explain
the source of data, target population, sampling mechanism, and variables in this data set. You can briefly review existing research or known results, which will help you in the analysis. You can find the data set from many sources, e.g., the AER package, Harvard dataverse. Both links provides information about this dataset. The brief mentions in previous section is also a good reference to read. You can find more by searching the key word “Project STAR” in, e.g., Google scholar.
Select the variables you find relevant based on your understanding in the Background section. Summarize univariate descriptive statistics for the selected variables (mean, standard deviations, missing values, quantiles, etc.). You can create the table using functions in base R, or use packages (see, e.g., this note).
From the data set, we can easily notice that various number of students are assigned to each teacher. In order to obtain one summary measure with teacher as the unit, we need to aggregate students’ performance (their math scores in 1st grade).
summarise() function helpful (link).Multivariate descriptive statistics for the outcome (the chosen summary measure for each teacher) with key variables (e.g., class types, school IDs).
ggplot2 (link).We can define a two-way ANOVA model as follows \(Y_{ijk} = \mu_{..} + \alpha_{i} + \beta_{j} + \epsilon_{ijk}\), where the index \(i\) represents the class type: small (\(i=1\)), regular (\(i=2\)), regular with aide (\(i=3\)), and the index \(j\) represents the school indicator. You need to explain the rest of the parameters, state constraints on the parameters, and justify the choice of model (e.g., why no interaction terms).
The proposed model is a two-way ANOVA model. You can find the assumptions easily from the course notes or read the wiki page on ANOVA. State these assumptions and try to explain them in the context of Project STAR. You can find assumptions for the regression model in a similar manner.
You can fit the ANOVA model using aov() in R (or lm() for the regression version). Report the fitted results with some attention on how/whether to report the estimated coefficients for school IDs.
The null hypothesis for the primary question of interest is \(H_0 : \alpha_1 = \alpha_2 = \alpha_3 = 0\), and the alternative is \(H_a\) : not all \(\alpha\)s are zero. You can find the test statistic and p-value using summary(anova.fit), if you save your fitted model as anova.fit. Please be sure specify the significance level and interpret your test result. Explain any additional assumptions involved in this test.
For the secondary question of interest, one option is the Tukey’s range test ( link). Again, specify the significance level, interpret your test result, and explain any additional assumptions involved in this test.
Examine the residuals of the fitted model. If you save the fitted ANOVA model as anova.fit, four elementary diagnostic plots are available via plot(anova.fit). You need to explain your findings in these plots (e.g., whether assumptions seem to hold).
You can find tests for some assumptions by searching the key words “test” and the corresponding assumptions. For instance, to test the equal variance assumption, there exist an F-test and Levene’s test.
For alternative methods, you can explore
As an example for the creativity category in the grading rubric, you can investigate the plausibility of making causal statements. such as smaller classes sizes lead to better performance. Discuss the assumptions for causal interpretation and whether they are plausible in Project STAR. See, for instance, Chapter 9 in Imbens and Rubin (2015).
Conclude your analysis in this section. You can touch on the following topics.
By default, it is assumed that you have discussed this project with your teammates and instructors. List any other people that you have discussed this project with.
List any references you cited in the report. See here for the APA format.
Imbens, G., & Rubin, D. (2015). Stratified Randomized Experiments. In Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction (pp. 187-218). Cambridge: Cambridge University Press. doi:10.1017/CBO9781139025751.010
Report information of your R session for reproducibility.
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.5.0 stringr_1.4.0 dplyr_1.0.2 purrr_0.3.4
## [5] readr_1.4.0 tidyr_1.1.2 tibble_3.0.4 tidyverse_1.3.0
## [9] plotly_4.9.2.1 ggplot2_3.3.2 AER_1.2-9 survival_3.2-7
## [13] sandwich_3.0-0 lmtest_0.9-38 zoo_1.8-8 car_3.0-10
## [17] carData_3.0-4
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.2 jsonlite_1.7.1 viridisLite_0.3.0 splines_4.0.3
## [5] modelr_0.1.8 Formula_1.2-4 assertthat_0.2.1 cellranger_1.1.0
## [9] yaml_2.2.1 pillar_1.4.7 backports_1.2.0 lattice_0.20-41
## [13] glue_1.4.2 digest_0.6.27 RColorBrewer_1.1-2 rvest_0.3.6
## [17] colorspace_2.0-0 htmltools_0.5.0 Matrix_1.2-18 pkgconfig_2.0.3
## [21] broom_0.7.2 haven_2.3.1 scales_1.1.1 openxlsx_4.2.3
## [25] rio_0.5.16 farver_2.0.3 generics_0.1.0 ellipsis_0.3.1
## [29] withr_2.3.0 lazyeval_0.2.2 cli_2.2.0 magrittr_2.0.1
## [33] crayon_1.3.4 readxl_1.3.1 evaluate_0.14 fs_1.5.0
## [37] fansi_0.4.2 xml2_1.3.2 foreign_0.8-80 tools_4.0.3
## [41] data.table_1.13.2 hms_0.5.3 lifecycle_0.2.0 munsell_0.5.0
## [45] reprex_1.0.0 zip_2.1.1 compiler_4.0.3 rlang_0.4.8
## [49] grid_4.0.3 rstudioapi_0.13 htmlwidgets_1.5.2 crosstalk_1.1.0.1
## [53] labeling_0.4.2 rmarkdown_2.5 gtable_0.3.0 abind_1.4-5
## [57] DBI_1.1.0 curl_4.3 R6_2.5.0 lubridate_1.7.9.2
## [61] knitr_1.30 stringi_1.5.3 Rcpp_1.0.5 vctrs_0.3.5
## [65] dbplyr_2.0.0 tidyselect_1.1.0 xfun_0.19